1) Inspection of single variables
- Summary statistics: 5-number summary, mean, variance, stem-and-leaf display, box-plot, histograms, density plots
2) Bi- and multivariate inspection
- Contingency tables, mosaic plot, scatter-plot, group stratified plots
3) Quantification of the above
- Association: suitable measures e.g. covariance, Pearson or rank correlation.
- Difference: suitable hypothesis tests, e.g. t-test, Wilcoxon signed rank test
# Libraries
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import datetime as dt
import seaborn as sns
from pandas import Series
from scipy.stats import ttest_ind
import statsmodels.formula.api as sm
from sklearn.preprocessing import Imputer
plt.figure(figsize=(50, 50), dpi= 100, facecolor='w', edgecolor='k')
feature_df = pd.read_csv("features.csv")
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
store_df = pd.read_csv("stores.csv")
feature_df.head()
feature_df.info()
train_df.head()
train_df.info()
test_df.head()
test_df.info()
store_df.head()
store_df.info()
Feature_df (8189 rows)
Train_df (421569 rows)
Test_df (115063 rows)
# Check if Null values exist
def null_checker(dataframe_column):
return dataframe_column.isnull().sum().sum()
| Stores | Features | Test | Train | |
|---|---|---|---|---|
| Store | 0 | 0 | 0 | 0 |
| Type | 0 | |||
| Size | 0 | |||
| Date | 0 | 0 | 0 | |
| Temp | 0 | |||
| Fuel_P | 0 | |||
| Promotion1 | 4158 | |||
| Promotion2 | 5269 | |||
| Promotion3 | 4577 | |||
| Promotion4 | 4726 | |||
| Promotion5 | 4140 | |||
| CPI | 585 | |||
| Unemployment | 585 | |||
| IsHoliday | 0 | 0 | 0 | |
| Department | 0 | 0 | ||
| Weekly_Sales | 0 | |||
| Weekly_Salestbp | 115064 |
# Lets output the summary statistics of the tables we want to focus on
feature_df.describe()
train_df.describe()
Both tables feature and train are analysed further using the 5 number summary and the percentage of NaN data. The fuel_price and unemployment are closer to the expected value (mean) where as Promotions 1 to 5 have a high standard deviation meaning they are spread out over a wider range of values. The reason for this is because Promotions 1 to 5 have a lot of missing data. So does CPI thus it has a high standard deviation.
In the table train the store variable has a very similar standard deviation with the store variable in feature_df meaning they can be merged. We have over +421000 weekly sales records, this can help us predict future sales using an ML model.
I didn't output the 5 number summary for tables test or stores because they hardly contain any variables that can be statistically analysed, mainly its categorical data not continuous.
feature_df = feature_df.fillna(0)
A low standard deviation indicates that the data points tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values.
%matplotlib inline
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(12,5)) # 1 row, 2 columns
train_df.plot.scatter(x = "Store", y = "Weekly_Sales", color = 'DarkBlue', ax=ax1)
train_df.plot.scatter(x = "Dept", y = "Weekly_Sales", color = 'DarkBlue', ax=ax2)
plt.tight_layout()
%matplotlib inline
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(20,8))
#sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.lineplot(x = 'Store', y ='Weekly_Sales', data = train_df, ax=ax1)
sns.lineplot(x = 'Dept', y ='Weekly_Sales', data = train_df, ax=ax2)
plt.tight_layout()
plt.show()
%matplotlib inline
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(12,5))
feature_df.plot.scatter(x = "Store", y = "Promotion1", color = 'DarkGreen', ax=ax1)
feature_df.plot.scatter(x = "Store", y = "Promotion2", color = 'DarkGreen', ax=ax2)
plt.tight_layout()
%matplotlib inline
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(12,5))
feature_df.plot.scatter(x = "CPI", y = "Promotion1", color = 'DarkGreen', ax=ax1)
feature_df.plot.scatter(x = "CPI", y = "Promotion5", color = 'DarkGreen',ax=ax2)
plt.tight_layout()
promotions 1 and 2 we can see very similar trends where stores have more low value promotions and less high value promotions. For promotion 5 we can clearly see a common trend of very low values of promotion across all 45 stores but one promotion value between 700000 and 800000.¶# We will visualise continuity with line plot graphs
feature_df['Date'] = pd.to_datetime(feature_df['Date'],
format='%d/%m/%Y')
%matplotlib inline
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(20,8))
#sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.lineplot(x = 'Date', y ='Promotion1', data = feature_df, ax=ax1)
sns.lineplot(x = 'Date', y ='Promotion2', data = feature_df, ax=ax2)
plt.tight_layout()
plt.show()
We have no Promotion1 or Promotion2 data for years 2010 and 2011 and then a spike in Promotions for 2012 and later in 2013.
%matplotlib inline
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(20,8))
#sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.lineplot(x = 'Date', y ='Promotion3', data = feature_df, ax=ax1)
sns.lineplot(x = 'Date', y ='Promotion4', data = feature_df, ax=ax2)
plt.tight_layout()
plt.show()
%matplotlib inline
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(20,8))
#sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.lineplot(x = 'Date', y ='CPI', data = feature_df, ax=ax1)
sns.lineplot(x = 'Date', y ='Promotion5', data = feature_df, ax=ax2)
plt.tight_layout()
plt.show()
# Covariance calculation
columns = ['Store', 'Temperature', 'Fuel_Price', 'Promotion1', 'Promotion2', 'Promotion3', 'Promotion4', 'Promotion5', 'CPI', 'Unemployment']
feature_df[columns].cov()
When we caclulate the covariance of all the numeric columns in our feature_df we can see there are:
Temperature and fuel_price. store, temperature and promotions 1 to 5. fuel_price, promotions 2, 4 and 5. fuel_price, promotions 1, 4 and 5 and CPI.fuel_price, Promotion5 and CPI.fuel_price, Promotions 1, 2 and 5.fuel_price and Promotions 1 to 4.Promotion2, Promotion3 and Unemployment.Store and CPI.In probability theory and statistics, covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, the covariance is positive.
Given the covariance relationships above, it would be interesting to further investigate the relationship between CPI and Unemployment.
feature_df.plot.scatter(x = "CPI", y = "Unemployment", color = 'DarkGreen')
The graphs above clearly shows that the major problem with this dataset is missing values this problem can then morpth into other areas and cause a lot of hassle in our future predictions and model making. The more data we have, the more trained our models can be and this can lead to more accurate results in the future analysis. I think it is paramount to try fill in some data using Deep Leaning techniques like Imputation. We must then compare the above analysis with the filled dataset.
Promotion 5¶feature_df.Promotion1
Regarding quality of the datasets. The values stored is mainly floating point but what does that indicate specifically? Is it a percentage of promotion? An example of data for Promotion1 is 53311.88. Clearly it's price reductions still the values are not very informative. We have some categorical data which would be very useful i.e. IsHoliday to try filter the data focusing on holiday promotions and non-holiday promotions. Weekly sales to be predicted is all Null but the number of rows can give us a picture of the amount of data we have to produce which is: 115064. We definitely will be using Weekly_Sales with Promotions and IsHolidays with Dates to try predict the Sales Forecasting on the Department level. The data quality is ok I guess as many key variables required to predict the Sales forecasts are provided like Department, IsHoliday, WeeklySales, Date and so on. The main problem is we don't have all the data for Promotions this too is a key variable and without it, we can't make accurate predictions of weekly sales. At least, however, we have some data thus we can try predict the missing data using methods like time series analysis.
features, train and store tables¶Date columns to DateTime dtypes¶train_df['Date'] = pd.to_datetime(train_df['Date'], format='%d/%m/%Y')
test_df['Date'] = pd.to_datetime(test_df['Date'], format = '%d/%m/%Y')
# Merging tables store_df
store_feature_merged_df = pd.merge(feature_df, store_df, on='Store')
store_feature_merged_df.head()
store_feature_merged_df.info()
# Merging tables store_feature_merged_df with train_df
final_df = pd.merge(train_df, store_feature_merged_df)
final_df.head()
final_df.info()
final_df['Date'].min()
final_df['Date'].max()
columns = ['Weekly_Sales', 'Temperature', 'Fuel_Price', 'Promotion1', 'Promotion2', 'Promotion3', 'Promotion4', 'Promotion5', 'CPI', 'Unemployment']
final_df[columns].cov()
# Pairwise correlation: seeing which pair is more relevant or which are the same etc.
final_df[columns].corr()
It's clear from the table above, that the promotions affect the weekly_sales the most. It makes sense as this sheds light on customer behaviour because we are more prone to shop when promotions are available. We are more inclined to purchase more when promotions are available in comparison to times when promotions are not available.
# Difference test: t-test for holidays and non holidays sales:
holiday_true = final_df[final_df['IsHoliday'] == True]
holiday_false = final_df[final_df['IsHoliday'] == False]
ttest_ind(holiday_true['Weekly_Sales'], holiday_false['Weekly_Sales'])
The t-value measures the size of the difference relative to the variation in your sample data.
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import label
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
regression_before_imputation = pd.concat([final_df, test_df])
regression_before_imputation = regression_before_imputation.fillna(0)
# Use min max scaler
scaler = MinMaxScaler()
DF_Scaled_1 = scaler.fit_transform(regression_before_imputation[columns])
DF_Scaled_1 = pd.DataFrame(data=DF_Scaled_1, columns=regression_before_imputation[columns].columns)
# Make X and y
y = DF_Scaled_1['Weekly_Sales']
DF_Scaled_1 = DF_Scaled_1.drop('Weekly_Sales', axis = 1)
X = DF_Scaled_1
#Reference Variable
DF_Scaled_1['_intercept'] = 1
# split the dataset into the training set and test set
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=0)
logit = sm.OLS(np.array(y_train), np.array(X_train))
# Fit the model
result = logit.fit()
print(result.summary())
# Here we will see if any variable affects weekly_sales.
%matplotlib inline
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(20,8))
final_df.plot.scatter(x='CPI', y='Weekly_Sales', ax=ax1, alpha = 0.2)
final_df.plot.scatter(x='Fuel_Price', y='Weekly_Sales', ax=ax2, alpha = 0.2)
plt.tight_layout()
%matplotlib inline
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(20,8))
final_df.plot.scatter(x='Promotion1', y='Weekly_Sales', ax=ax1, alpha = 0.2)
final_df.plot.scatter(x='Promotion2', y='Weekly_Sales', ax=ax2, alpha = 0.2)
plt.tight_layout()
%matplotlib inline
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(20,8))
final_df.plot.scatter(x='Promotion3', y='Weekly_Sales', ax=ax1, alpha = 0.2)
final_df.plot.scatter(x='Promotion4', y='Weekly_Sales', ax=ax2, alpha = 0.2)
plt.tight_layout()
%matplotlib inline
final_df.plot(kind = "scatter", x = 'Promotion5', y = 'Weekly_Sales', alpha = 0.2, figsize=(20,8))
plt.tight_layout()
Let us have a look at sales fluctuations for the four holidays, Super Bowl, Labor Day, Thanksgivin and Christmas using time series analysis.
mask = (final_df['Date'] >= '2010-02-05') & (final_df['Date'] <= '2010-02-19')
super_bowl_2010 = final_df.loc[mask]
mask_1 = (final_df['Date'] >= '2011-02-04') & (final_df['Date'] <= '2011-02-18')
super_bowl_2011 = final_df.loc[mask_1]
mask_2 = (final_df['Date'] >= '2012-02-03') & (final_df['Date'] <= '2012-02-17')
mask_3 = (final_df['Date'] >= '2013-02-01') & (final_df['Date'] <= '2013-02-15')
super_bowl_2012 = final_df.loc[mask_2]
super_bowl_2013 = final_df.loc[mask_3]
columns = ['Date', 'Weekly_Sales']
super_bowl_2010 = super_bowl_2010[columns]
super_bowl_2011 = super_bowl_2011[columns]
super_bowl_2012 = super_bowl_2012[columns]
super_bowl_2013 = super_bowl_2013[columns]
%matplotlib inline
# We only have weekly sales data for superbowl year 2010. We can now only focus on 2010 holidays.
#super_bowl_2010['Weekly_Sales'].plot(linewidth=0.5)
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.lineplot(x = 'Date', y ='Weekly_Sales', data = super_bowl_2010)
plt.tight_layout()
plt.show()
%matplotlib inline
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.lineplot(x = 'Date', y ='Weekly_Sales', data = super_bowl_2011)
plt.tight_layout()
plt.show()
%matplotlib inline
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.lineplot(x = 'Date', y ='Weekly_Sales', data = super_bowl_2012)
plt.tight_layout()
plt.show()
labor_day_mask = (final_df['Date'] >= '2010-09-03') & (final_df['Date'] <= '2010-09-17')
thanksgiving_mask = (final_df['Date'] >= '2010-11-19') & (final_df['Date'] <= '2010-12-01')
christmas_mask = (final_df['Date'] >= '2010-12-24') & (final_df['Date'] <= '2011-01-07')
labor_2010 = final_df.loc[labor_day_mask]
thanksgiving_2010 = final_df.loc[thanksgiving_mask]
christmas_2010 = final_df.loc[christmas_mask]
%matplotlib inline
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.lineplot(x = 'Date', y ='Weekly_Sales', data = labor_2010)
plt.tight_layout()
plt.show()
%matplotlib inline
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.lineplot(x = 'Date', y ='Weekly_Sales', data = thanksgiving_2010)
plt.tight_layout()
plt.show()
%matplotlib inline
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.lineplot(x = 'Date', y ='Weekly_Sales', data = christmas_2010)
plt.tight_layout()
plt.show()
R2_score to see how accurate the imputed values are from the training data.# Libraries used
import datawig
import pandas as pd
from sklearn.metrics import r2_score as score
# After imputing final_df.csv we clean the data by rounding to 1 decimal place and filling the null values for
# the f2_score metric to not throw an error regarding float size or null values.
df = pd.read_csv('final_df.csv')
df = df.round(1)
df = df.fillna(0)
# Split the dataset into training and test data, 80% to %20
df_train, df_test = datawig.utils.random_split(df, split_ratios=[0.8, 0.2])
# Creating the SimpleImputer object, the input columns are the columns we believe are relevant in calculating the
# Promotions. Output_column is the column we are imputing the values for. The output_path is the output log.
imputer = datawig.SimpleImputer(
input_columns = ['Weekly_Sales', 'Temperature', 'Fuel_Price',
'Promotion5', 'Promotion4', 'CPI', 'Promotion2', 'Promotion3'],
output_column = 'Promotion1',
output_path = 'imputer_model'
)
# We fit the training data and state the number of epochs
imputer.fit(train_df=df_train, num_epochs = 50)
# The imputed dataframe is produced.
imputed = imputer.predict(df_test)
# We calculate the f2_score for the Promotion1 and the Imputed Promotion1
f1 = score(imputed['Promotion1'], imputed['Promotion1_imputed'])
print('Promotion1 f2_score: ', f1)
# Let us now read in our csv files with imputed columns.
Promotion1_imputed = pd.read_csv('Promotion1_imputed.csv')
Promotion2_imputed = pd.read_csv('Promotion2_imputed.csv')
Promotion3_imputed = pd.read_csv('Promotion3_imputed.csv')
Promotion4_imputed = pd.read_csv('Promotion4_imputed.csv')
Promotion5_imputed = pd.read_csv('Promotion5_imputed.csv')
CPI_imputed = pd.read_csv('CPI_imputed.csv')
Promotion1_imputed['Promotion2_imputed'] = Promotion2_imputed['Promotion2_imputed']
Promotion1_imputed['Promotion3_imputed'] = Promotion3_imputed['Promotion3_imputed']
Promotion1_imputed['Promotion4_imputed'] = Promotion4_imputed['Promotion4_imputed']
Promotion1_imputed['Promotion5_imputed'] = Promotion5_imputed['Promotion5_imputed']
Promotion1_imputed['CPI_imputed'] = CPI_imputed['CPI_imputed']
final_df = Promotion1_imputed
final_df['Date'] = pd.to_datetime(final_df['Date'],
format='%Y-%m-%d')
final_df.describe()
final_df = final_df.drop(axis = 1, columns = 'Unnamed: 0')
final_df = final_df.drop(axis = 1, columns = 'Unnamed: 0.1')
final_df.head()
final_df['Date'].min()
final_df['Date'].max()
Now that we have imputed values, for Promotions and CPI. We can use the extra data imputed to train another model to predict the weekly_sales.
Predictions of Weekly_Sales
final_including_ws = pd.read_csv('Weekly_Sales_imputed.csv')
final_including_ws.info()
final_including_ws = final_including_ws .drop(axis = 1, columns = 'Unnamed: 0')
final_including_ws = final_including_ws .drop(axis = 1, columns = 'Unnamed: 0.1')
final_including_ws.head()
final_including_ws['Date'] = pd.to_datetime(final_including_ws['Date'],
format='%Y-%m-%d')
final_including_ws['Date'].min()
final_including_ws['Date'].max()
final_including_ws = final_including_ws.sort_values(by = 'Date')
final_including_ws.hist(figsize=(20,20))
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import label
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
final_df_v2 = pd.read_csv("2013_sales_complete.csv")
final_df_v2.info()
final_df_v2 = final_df_v2.fillna(0)
final_df_v2 = final_df_v2.drop('Unnamed: 0', axis = 1)
final_df_v2 = final_df_v2.drop('Unnamed: 0.1', axis = 1)
final_df_v2['Date'] = pd.to_datetime(final_df_v2['Date'],
format='%Y-%m-%d')
final_df_v2['Date'].min()
final_df_v2['Date'].max()
columns_1 = ['Store', 'Fuel_Price', 'Dept','Weekly_Sales', 'Temperature', 'Promotion1', 'Promotion2', 'Promotion3',
'Promotion4', 'Promotion5', 'Promotion1_imputed', 'Promotion2_imputed', 'Promotion3_imputed',
'Promotion4_imputed', 'Promotion5_imputed', 'Unemployment', 'Size (sq ft)', 'Weekly_Sales_imputed', 'Weekly_Sales to be predicted', 'CPI_imputed', 'CPI']
# Use min max scaler
scaler = MinMaxScaler()
DF_Scaled = scaler.fit_transform(final_df_v2[columns_1])
DF_Scaled = pd.DataFrame(data=DF_Scaled, columns=final_df_v2[columns_1].columns)
DF_Scaled = DF_Scaled.fillna(0)
DF_Scaled.head()
# Make X and y
y = DF_Scaled['Weekly_Sales_imputed']
DF_Scaled = DF_Scaled.drop('Weekly_Sales_imputed', axis = 1)
X = DF_Scaled
#Reference Variable
DF_Scaled['_intercept'] = 1
# split the dataset into the training set and test set
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=0)
imp.fit_transform(X_train)
logit = sm.OLS(np.array(y_train), np.array(X_train))
# Fit the model
result = logit.fit()
from IPython.display import Image
Image("OLS_results.png")
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=20, random_state=0)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
%matplotlib inline
"""
We will now use pairplot from the Seaborn library which allows us to draw statistical plots based
on Matplotlib. Once we visualise the data it would be easier to detect outliers.
"""
# This is the subset of columns I will be focusing on, these can be changed to include other columns as well.
columns_2 = ['Fuel_Price', 'Dept','Weekly_Sales', 'Temperature', 'Promotion1', 'Promotion2', 'Promotion3',
'Promotion4', 'Promotion5', 'Promotion1_imputed', 'Promotion2_imputed', 'Promotion3_imputed',
'Promotion4_imputed', 'Promotion5_imputed', 'Size (sq ft)', 'Weekly_Sales to be predicted', 'CPI_imputed']
flatui = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e", "#2ecc71"]
sns.set_palette(flatui)
sns.pairplot(final_df_v2[columns_2], height = 5.0)
plt.tight_layout()
plt.show()
columns_3 = ['Date','Fuel_Price', 'Dept','Weekly_Sales', 'Temperature', 'Promotion1', 'Promotion2', 'Promotion3',
'Promotion4', 'Promotion5', 'Promotion1_imputed', 'Promotion2_imputed', 'Promotion3_imputed',
'Promotion4_imputed', 'Promotion5_imputed', 'Size (sq ft)', 'Weekly_Sales to be predicted', 'Weekly_Sales_imputed', 'CPI_imputed']
%matplotlib inline
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(20,8))
sns.lineplot(x = 'Date', y ='Weekly_Sales_imputed', data = final_df_v2, ax=ax1)
sns.lineplot(x = 'Dept', y ='Weekly_Sales_imputed', data = final_df_v2, ax=ax2)
plt.tight_layout()
plt.show()
%matplotlib inline
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(20,8))
sns.lineplot(x = 'Temperature', y ='Weekly_Sales_imputed', data = final_df_v2, ax=ax1)
sns.lineplot(x = 'Size (sq ft)', y ='Weekly_Sales_imputed', data = final_df_v2, ax=ax2)
plt.tight_layout()
plt.show()
%matplotlib inline
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(20,8))
sns.lineplot(x = 'Date', y ='Promotion1_imputed', data = final_df_v2, ax=ax1)
sns.lineplot(x = 'Date', y ='Promotion2_imputed', data = final_df_v2, ax=ax2)
plt.tight_layout()
plt.show()
%matplotlib inline
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(20,8))
sns.lineplot(x = 'Date', y ='Promotion3_imputed', data = final_df_v2, ax=ax1)
sns.lineplot(x = 'Date', y ='Promotion4_imputed', data = final_df_v2, ax=ax2)
plt.tight_layout()
plt.show()
%matplotlib inline
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(20,8))
sns.lineplot(x = 'Store', y ='Weekly_Sales_imputed', data = final_df_v2, ax=ax1)
sns.lineplot(x = 'Dept', y ='Weekly_Sales_imputed', data = final_df_v2, ax=ax2)
plt.tight_layout()
plt.show()